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Abstract: Recently, due to the advent of resource-constrained trends, such as smartphones 
and smart devices, the computing environment is changing. Because our daily life 
is deeply intertwined with ubiquitous networks, the importance of security is growing. 
A lightweight encryption algorithm is essential for secure communication between these 
kinds of resource-constrained devices, and many researchers have been investigating this 
field. Recently, a lightweight block cipher called LEA was proposed. LEA was originally 
targeted for efficient implementation on microprocessors, as it is fast when implemented 
in software and furthermore, it has a small memory footprint. To reflect on recent 
technology, all required calculations utilize 32-bit wide operations. In addition, the algorithm 
is comprised of not complex S-Box-like structures but simple Addition, Rotation, and 
XOR operations. To the best of our knowledge, this paper is the first report on a 
comprehensive hardware implementation of LEA. We present various hardware structures 
and their implementation results according to key sizes. Even though LEA was originally 
targeted at software efficiency, it also shows high efficiency when implemented as hardware. 
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1. Introduction 

Recent improvements in semi-conductor technology have enabled the computing environment to 
become mobile, and accelerated the change to a ubiquitous era. The use of small mobile devices is 
growing explosively, and the importance of security is increasing daily. One of the essential ingredients 
of smart device security is a block cipher, and lightweight energy-efficient implementation techniques 
are required for small mobile devices. 

Techniques for securing resource-constrained devices such as RFID (Radio-frequency Identification) 
tags have been proposed. In 2005, Lim and Korkishko [1] presented a lightweight block cipher called 
mCrypton that encrypts plaintext into ciphertext by using 4 by 4 nibble (4-bit) matrix-based simple 
operations such as substitution (S-Box), permutation, transposition, and key addition (XOR). The 
following year, Hong et al. [2] proposed a lightweight block cipher called HIGHT, which has a Feistel 
structure and operates with simple calculations such as XOR, addition, subtraction, and rotation. In 
2007, Bogdanov et al. [3] introduced PRESENT, which is comprised of substitution, permutation, and 
XOR. In 2009, KATAN and KTANTAN were proposed by Cammoere et al. [4] KATAN divides plaintext 
into two parts and stores them into two registers, and the outputs from non-linear functions are stored 
in the least significant bit (LSB) of each other's register. On the other hand KTANTAN is a fixed-key 
version of KATAN and has a different key scheduling scheme. In the same year, Rotor-based Humming 
Bird was proposed by Revere Security. However, these algorithms have been revealed to be vulnerable 
to chosen-IV attacks and chosen message attacks. Two years later, HummingBird2 [5], an improved 
version of HummingBird, was proposed. In 2011, Guo et al. [6] proposed a lightweight cipher LED, 
with a structure similar to AES, but it does not perform key scheduling. 

Both lightweight block ciphers and methods to optimize legacy block ciphers have been studied. 
Moradi et al. [7] optimized AES and reduced the gate count to 2,400 GE (gate equivalent). 
Poschmann et al. [8] implemented DES with 1,848 GE. 

Recently, the Electronics and Telecommunications Research Institute in Korea announced a 
new lightweight block cipher called LEA [9]. The focus of LEA design is a "software-oriented 
lightweightness" for resource-constrained small devices. It is intended to have a small code size and 
consume low power. Therefore, it is extremely efficient when it is implemented in software. LEA has 
three key sizes of 128, 192, or 256 bits and a 128-bit block size. Every inner operation of the LEA is 32 
bits wide, since 32-bit microprocessors are more popular than 8-bit ones these days. Further, it does not 
employ a complex operation such as S-Box, and only uses simple operations such as addition, rotation, 
and XOR (ARX). 

Usually, small chip size and reasonably fast encryption is preferred for cryptographic hardware 
for small devices in resource constrained environments such as RFID tags or smart meters for smart 
grids. In this paper, we propose several methods to optimize LEA hardware for all key sizes and 
present implementation results in terms of time and chip area cost. This work is the first that 
studies a comprehensive hardware implementation of LEA. LEA was originally designed for software 
implementation, but we aim to demonstrate that it is also efficient when implemented in hardware. 

The rest of this paper is organized as follows: We introduce the LEA algorithm in Section 2, and 
then present elemental techniques for implementing LEA in hardware in Section 3. Section 4 presents 



Sensors 2014, 14 



977 



hardware structures for the 128, 192, and 256 key version of LEA, and corresponding implementation 
results are presented in Section 5. We conclude this paper in Section 6. 

2. LEA Algorithm 

In this section, we introduce the LEA block cipher. LEA has 128 bit long message blocks and 128, 
192, or 256 bit long keys. We denote each version of this algorithm as LEA- 128, LEA- 196, and LEA-256 
according to key length. 



2.7. Notations 



We present notations and corresponding descriptions required to explain the LEA algorithm in 
Table 1. 

Table 1. Notations used to explain LEA algorithm. 



Symbol Meaning 



P 1 28-bit plaintext. P = P 0 \ P x | P 2 1 Ps • each P n is 32-bit. 

C 1 28-bit ciphertext. C = C 0 1 Ci | C 2 | C 3 . each C n is 32-bit. 

L(x) Length of bit sequence x. 
K Master key. K = K 0 \Ki\...\K n . n = 3 where L{K) = 128, n = 5 where L(K) = 192, 

and n = 7 where L(K) 256. 
X 1 Intermediate value of the z-th encryption state. X 1 = X l Q \X\\X l 2 \Xl where 0 < i < r. 

Each X l n is 32-bit. 

T l Intermediate value of the z-th key schedule state. T l = T^TWT^Tl where 0 < i < r. Each 

T l n is 32-bit. 

5o, Si, S n Constant value used for the key schedule, n = 3 where L(K) = 128, n = 5 where 
L(K) = 192, and n = 7 where L(K) = 256. 
r Number of round iterations, r = 24 where L(K) = 128, r = 28 where L{K) — 192, and 

r = 32 where L(K) = 256. 
RK l 1 92-bit round key used for the i-th round. RK l = RK l 0 1 RK\ \ RK\ \ RK\ \ RK\ \ RK\ where 
0 < i < r. Each RK l n is 32-bit. 
0 XOR operation, 

ffl Addition modulo 2 32 . 

ROLi (x) x-bit left rotation. 
RORi (x) x-bit right rotation. 



2.2. Key Schedule 
2.2.1. Constants 

4, 6, and 8 constant values that are 32 bits long are used for each version of the LEA key schedule. 
Each constant is defined as follows: 
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S 0 = C3EFE9DB 16 , S 1 = 4462650216 
8 2 = 79E27C8A 16 , S 3 = 78DF30EC 16 
8 4 = 715EA49E 16 , 5 5 = C785DA0A 16 
S 6 = E04EF22A 16 , 5 7 = EbC '40957 46 

The constants are generated from the hexadecimal expression of ^766, 995, where 76, 69, and 95 are 
ASCII codes for "L", "E", and "A". 

2.2.2. Key Schedule for 128-Bit Key 

At the beginning of the LEA- 128 key schedule, the key state T is assigned as T~ l = K n where 
0 < n < 4. The key schedule of LEA- 128 is defined as follows: 

T 0 m <- ROL^Tq ffl ROL^ mod 4 )) 
T[ +1 <- ROL 3 (Ti ffl ROL l+l {5i mod 4 )) 

T* +1 ^ ROL e (T 2 ffl ROL l+2 {5i mod4 )) (2) 
Ti +1 <- ROL^iTi ffl ROL i+3 (5i mod 4 )) 
RFC <- (Ti,Tlr 2 ,TlT*,Ti) 

2.2.3. Key Schedule for 192-Bit Key 

The key schedule of LEA-192 also starts with setting T as T" 1 = K n where 0 < n < 6. The key 
schedule of LEA-192 is defined as follows: 

T 0 +1 ^ROL^mROL^ mod6 )) 

Ti +1 <r- ROLs^SROL^i mod6 )) 

T 2 +1 ^ROL & {T 2 ffl ROL l+2 (5i mod6 )) 

T^ +1 <r- ROLu(T£ ffl ROL i+3 (Si mod6 )) (3) 
Tl +1 4— ROLis(Tl ffl ROLi+4(5i mo d 6)) 
Tl +1 «— ROL 17 (T$ ffl ROL i+ $(5i mod 6 )) 
RK { <- {TlTlTlTlTlU) 

2.2.4. Key Schedule for 256-Bit Key 

Likewise, the key schedule of LEA-256 starts with setting T as T" 1 = K n where 0 < n < 8, and is 
defined as follows: 



rpi+1 
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2.3. Encryption Procedure 

As described in Section 2.1, LEA- 128/192/256 iterates in 24/28/32 rounds. Unlike AES [10] or 
HIGHT [2], which require a special final round function, LEA uses only one round function. Figure 1 
shows the round function of LEA. At the beginning of the encryption, the intermediate state X is set as 
A° = P n where 0 < n < 4 and the following round function is executed r times: 



X l 0 +1 <- ROL 9 ((X l 0 e RKi) ffl {X\ e RK\)) 
X\ +1 <- ROR 5 {(X{ e RKi) ffl (xi e RKi)) 
X l 2 +1 <- ROR 3 ((X l 2 e RKi) ffl (A3 e RKi)) 



Xl +1 <r- Xq 



The final C n = X r n is generated and used as ciphertext where 0 < n < 4. 

Figure 1. Round function of LEA. 
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3. Elemental Hardware Structures for LEA Calculation 

This section describes elemental hardware structures used for implementing LEA hardware. 

3.1. Constant Value Schedule Logic for Speed-Optimized Implementation 

LEA employs several constants for key scheduling. To design the constant schedule logic, the usage 
patterns of constants need to be analyzed. In Equation (5), the constant values used for the i-th round 
function are ROL^ mod 4 ), ROL i+1 (5i mod 4 ), ROL i+2 (^ mod 4), and ROL i+3 (5i mod 4 ). At the i-th 
round, the i mod 4-th constant is chosen; in other words, constants are used in increasing order, i.e., 
r5 0 , 5i, 5 3 , f5 0 , .... After a constant is chosen, it is rotated z, i + 1, i + 2, and i + 3 times to the left. 
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Figure 2 shows the intuitive structure of the constant schedule logic of the 128-bit speed-optimized 
version of LEA hardware. The speed-optimized version executes one round per clock cycle. Therefore, 
it should generate all four constants required for a round. Constants 5 0 to 5 3 are stored in 32-bit flip-flops 
c 0 to c 3 . Each value in a 32-bit flip-flop moves to the next flip-flop per round. Since a constant value 
that is rotated i-times (i + 1, i + 2, and i + 3 times) is used for the i-th round, it is rotated 1 bit left 
for every round. Since the constant used for the i-th round is located at the c 0 register, its value is 
exactly ROL^Si mod 4). The remaining ROL i+1 (5i mod 4), ROL i+2 (5i mod 4), and ROL i+3 (Si mo d4) 
are generated from corresponding ROLi, ROL 2 , and ROL 3 operations. In the figure, no rotation 
consumes any logical gates because they can be easily implemented by crossing some wires. Thus, 
the logic requires only 128 flip-flops. 

Figure 2. Constant scheduling logic structure for speed-optimized LEA hardware. 
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3.2. Constant Value Schedule Logic for Area-Optimized Implementation 

To minimize the number of gates required, some logic gates are shared and iteratively used in a 
round. In area-optimized implementation, one round can be split into several clock cycles. Therefore, 
four constants must be generated one by one in a round. The intuitive structure of constant scheduling 
logic is depicted in Figure 3. At the beginning of a round, c 0 is fed with ROLi{5i mo d 4) from c\. The 
value is passed to the key scheduling logic through the first path of the MUX. For the remaining clock 
cycles of one round, ROL i+1 (5i mod4 ), ROL i+2 (^i mod 4), and ROL i+3 (5i mod4 ) are fed to the key 
scheduling logic using the second, third, and fourth path of the MUX. 

An alternative logic structure for area-optimized LEA is depicted in Figure 4. The 32-bit constant in 
c 0 is fed to the key scheduling logic. When the round counter is increased, the upper path of MUX is 
used, which leads ROL^Si mod 4) at c\ to move to the c 0 register. In a round, the remaining constant 
values used for the i-th round function, ROL i+ i(5i mod4 ), ROL i+2 (Si mod 4). and ROL i+3 (5i mod 4), 
are generated during the remaining three clock cycles using the lower path of MUX. By using this 
structure, the cost for the four-input MUX is reduced to that of a two-input MUX. Moreover, the 
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rotating logic before c 3 is different from that in Figure 3. At the final state of a round, the c 0 is 
ROLi + 3(deltai mod 4 ). To make ROLi + 4(de/ia^ mod 4 ) have the same value at a register after four 
rounds, c 0 should be rotated to the right twice. Consequently, the rotation logic before the c 3 register in 
Figure 3 is different from that in Figure 4. 

Figure 3. Intuitive constant scheduling logic structure for area-optimized LEA hardware. 
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Figure 4. Alternative constant scheduling logic structure for area-optimized LEA hardware. 
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4. Proposed Hardware Structure of LEA 

In this section, we describe hardware implementation methods according to three key sizes and 
the optimization goal(speed or area). Even though the three key versions of LEA use the same 
round-function, their key scheduling algorithms are different. Therefore, it is impossible to carry out 
different hardware implementations using the same logic for key scheduling, since they have different 
structures. The following subsections describe each LEA implementation focused on the key scheduling 
method. To specify each version according to the key size and optimization goal, each version will 
be denoted as LEA-KEYSIZE-OPTIMIZATIONLGOAL (e.g., LEA-128-SPEED refers to the 128-bit 
version of the LEA implementation with the target of speed improvement). 



4.1. LEA Implementation Using 128-Bit Key 

AAA. LEA- 128- AREA- 1 

Figure 5 shows the data path of LEA- 128- AREA- 1. The left side of the data path deals with the round 
function and the right deals with the scheduling. Twelve 32-bit registers are used. x 0 to x 3 are registers 
that save the internal state, while t 0 to i 3 are key registers. The remaining four registers, c 0 to c 3 , are 
constant registers. 
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Plaintexts X 0 to X 3 are supplied to registers x 0 to x 3 in reverse order through the leftmost path of 
PMUX, and keys T 0 to T 3 are shifted using the upper path of KMUX and stored in registers t 0 to t 3 . 
Four clocks are required to schedule keys, and three clocks are required to update states in a round. Keys 
in each 32-bit register are scheduled one by one. In accordance with Equation (2), the key in register 
t 0 is added to a constant and rotated left to a specified number, and is then stored in register £ 3 . After 
four clocks of the key scheduling cycle, the round function begins to run. According to Equation (5), 
two XOR and one addition operations are repeated in a round. For the area-optimized version, we tried 
to reduce the area by sharing the operations. (X 2 , X 3 ), (Xl, X 2 ), and (X 0 , X\) are sequentially fed to 
the two XORs, and both results are added. Scheduled round keys are supplied from registers t 0 t0 h- 
Since T\ is always required for the input of one XOR, the output of t\ is directly connected to the input 
of the other XOR. The remaining outputs of £ 0 , and £ 3 are selected by RKMUX, and then keys are 
supplied in (RK 0l RKi), (RK 2 , RK\) and (RK3, RK1) order. The output of the adder is then fed to 
three rotation logics, and one of them is chosen along with clock cycles and stored in register x 0 . In 
this case, 7 clock cycles are required for a round, thereby completing encryption in 168 clock cycles 
excluding cycles for input and output. 

Figure 5. Datapath of LEA- 128- AREA- 1. 




4.1.2. LEA-128-AREA-2 

Figure 6 presents another version of the size-optimized LEA- 128 implementation. This version 
reduces the required clock cycles from seven to four compared to LEA- 128- AREA- 1. The most 
significant difference between this version and the previous one is that it supplies the schedule key 
RK on the fly. To achieve this, keys are inserted into the register in the order of 7\, T 3 , T 2 , and 7\. Since 
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RK 1 is always used during a round, it is preferentially scheduled and stored in the t 0 register. Next, T 3 
in the t\ register is scheduled, and the value from RMUX is directly supplied to the XOR operation of 
the round function. In this way, the remaining keys are also scheduled and used for the round function. 
Since RK\ has been moved to registers t 0 ,t 2 , and t 3 along with clock cycles, RKMUX is used to select 
the register that has RK\. Since keys are not scheduled in increasing order as in LEA- 128- AREA- 1, the 
constant generating logic in Figure 4 cannot be used. Therefore, the logic in Figure 3 is used. In this 
implementation, one round of operations is carried out in 4 clock cycles, and altogether 96 cycles are 
required for encryption. 

Figure 6. Datapath of LEA-128-AREA-2. 
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4.1.3. LEA-128-SPEED 

Figure 7 shows the data path of LEA-128-SPEED. As seen in the figure, all the required operation 
logics for a round are arranged for parallel processing in order to execute a round in a clock cycle. 
Plaintext registers have a MUX for selecting input from an outside or internal state. Further, key registers 
have a MUX for choosing a key from outside or among the scheduled keys. The constant generation logic 
in Figure 2 is used. 

4.1.4. LEA- 192- ARE A- 1 

Figure 8 presents the data path of LEA- 192- AREA- 1. In the case of the 192-bit version of LEA, six 
32-bit keys are supplied and six 32-bit constants are used. Unlike LEA- 128 which uses T\ iteratively, 
LEA- 192 uses round keys T 0 to T 5 once in a round. Therefore, a simpler implementation than LEA- 128 
is possible. This implementation encrypts 128-bit plaintext in 24 clock cycles. 



Sensors 2014, 14 



984 



Figure 7. Datapath of LEA-128-SPEED. 
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Figure 8. Datapath of LEA- 192- AREA- 1. 
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The round function is the same as that used by LEA- 128, but it differs in terms of the key schedule 
logic. First, the key input sequence differs from that found in LEA- 128- AREA- 1. Keys T 5 to T 0 are 
scheduled one by one. According to Equation (3), two round keys are used for a round function step. To 
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use the scheduled key on the fly, one of the keys is scheduled in advance and stored in a t 0 register, and is 
then used for the input of one XOR of the round function. Next, the other key is scheduled and supplied 
to the other XOR. Since (X 2 , X 3 ) is used first for calculation, (T 5 , T 4 ) should be supplied first. This also 
changes the constant generation logic. Since constants are used in ROL i+5 (5i mod 6 ) to ROL^i mod 6 ) 
order, ROL 6 (ci) is moved to the c\ register at the start of a round. The value is then rotated to the right 
in every clock. Therefore, ROLi (c 0 ) is moved to c 5 when the value in c 0 moves to c 5 at the beginning of 
a round. Further, register c 0 is initialized with ROL$(5i) for the above reason. This processes one round 
in 6 clock cycles, and thus 168 clock cycles are required to encrypt a 128-bit message. 

4.1.5. LEA-192-AREA-2 

Figure 9 shows the data path of LEA-192-AREA-2, which is a faster version of LEA- 192- AREA- 1. 
This implementation schedules two keys in a clock cycle. The sequence of the key input is the same as 
that used by LEA- 192- AREA- 1. However, there is a small difference in their constant generation logic. 
To generate two constants simultaneously, the rotation logic is attached to the c 0 register. Further, one 
more adder is added. KMUX is divided into two MUXes. The generated round keys are directly supplied 
to two XORs in the round function. In this implementation, three clock cycles are needed to process a 
round, and thus 84 clock cycles are needed to encrypt a plaintext block. 

Figure 9. Datapath of LEA-192-AREA-2. 
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4.1.6. LEA-192-SPEED 



LEA-192-SPEED in Figure 10 has the same structure as LEA-128-SPEED, except that it has more 
registers for keys and constants. It requires 28 clock cycles to encrypt a plaintext block. 
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Figure 10. Datapath of LEA-192-SPEED. 
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4.1.7. LEA-256-AREA-1 

Figure 11 presents the structure of LEA-256-AREA-1. As seen in the area-opt hardware structure 
of both LEA- 128 and LEA- 192, they use same hardware structure for the round function. In the 
implementation, (X 2 ,X 3 ), (Xi,X 2 ), and (X 0 ,Xi) order, the plaintext(state) blocks are fed to shared 
operation logic. If this order is changed or reversed, the structure may be complex. For that reason 
we also used this structure for LEA-256. In this case, the round keys are fed to the operation logic in 
(RK^ RK$), {RK 2) RK 3 ) and (RK 0 , RKi) order. However, from the Equation (4), key scheduling for 
LEA-256 in Figure 1 1 may be the simplest way. This structure schedules keys in T 0 to T 5 order, then 
the next key generation is started from T 6 and finished at T 3 . Therefore, scheduled keys are required 
to be once stored in the register, then should be used for the round function. That is, LEA-256 is not 
suitable for on-the-fly key generation. The round keys are generated during six clock cycles and stored 
in registers t 2 to and are then used for the round function. This requires 9 clock cycles for a round, 
and 288 clock cycles are needed in all to encrypt a plaintext block. 



4.1.8. LEA-256-AREA-2 



Figure 12 shows another version of area-optimized LEA-256 hardware. This version is similar to 
LEA-192-AREA-2, which schedules two round keys in a clock. As with LEA- 128- AREA- 1, on-the-fly 
round key generation is impossible. Each round key is scheduled once and stored in the register, and is 
then used for the round function. This reduces the time for scheduling the round key to half of that taken 
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by LEA-256-AREA-1, and it processes one round in 6 clock cycles, thus requiring 192 clock cycles to 
encrypt a message block. 

Figure 11. Datapath of LEA-256-AREA-1. 



Round Function 



ROR 3 ROR 5 ROL 9 



000BB0 



000 



H rou 



€3 



III 



ROU ROL3 ROL 6 ROL n ROL 13 ROL 17 



£3 



Figure 12. Datapath of LEA-256-AREA-2. 




4.1.9. LEA-256-SPEED 

The structure of LEA-256-SPEED is depicted in Figure 13. LEA-256 schedules six of eight round 
keys for a round, and the remaining two and following four keys are used for the next round key 
generation. Therefore, values in t 0 to t 5 are scheduled and stored in the i^ +2 register. The values not 
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used values in t 6 and t 7 are moved to t 0 an d *i, respectively. This implementation requires 32 clock 
cycles to encrypt a 128-bit plaintext. 

Figure 13. Datapath of LEA-256-SPEED. 
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5. Implementation Results 

5.7. FPGA 

All of the designs described in Section 4 were implemented in Register Transfer Level(RTL) in 
Verilog. We present the FPGA synthesis result for well-known chips: the Xilinx Virtex 5 series and 
Altera Cyclone-III series. The Xilinx series was synthesized using ISE 13.4, while the Altera series was 
synthesized using Quartus-II ll.lsp2. 

The implementation results for the Xilinx Virtex 5 chip are summarized in Table 2. The number of 
slice elements is counted before being packed into a slice. Looking at the feature, the speed-optimized 
versions had a higher ATP and throughput per area than the area-optimized versions. This implies 
that even if replicative XOR and adder logic are reduced in the area-optimized implementation, the 
amount of reduced logic is negligible. Compared to LEA- 128- AREA- 1, the size of LEA-128-SPEED 
is increased by 70%, but the number of cycles is decreased by a factor of 7 times. On the other hand, 
compared to LEA- 128- AREA- 1, LEA-128-AREA-2 has a low operating frequency. An analysis of this 
phenomenon reveals that, in the case of LEA-128-AREA-2, the path from c 0 to x 0 is a critical path, 
which is the longest path in the implementation. In contrast to LEA- 128- AREA- 1, LEA-128-AREA-2 



Sensors 2014, 14 



989 



has one additional MUX gate in the path from c 0 to x 0 , which makes the path longer. On the other 
hand, LEA- 128- AREA- 1, LEA-256-AREA-1, and LEA-256-AREA-2, which store the scheduled keys 
in registers, have short critical paths, since the path from c 0 to x 0 is not required. Consequently, their 
critical paths are shorter, and the operating frequency is high. Figure 14 shows the normalized throughput 
and area compared to LEA- 128- AREA- 1. 



Table 2. Comparison of implementation results using Xilinx Virtex 5. 
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Figure 14. Generalized throughput and area graph to compare relative performance (Xilinx 
Virtex-5). 
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Table 3 shows the implementation results for Altera Cyclone-III. The overall characteristics of the 
implementation are similar to those for Xilinx. Also, Figure 15 shows the normalized throughput and 
size based on LEA- 128- AREA- 1. The relative implementation results can be found in the figure. 



Table 3. Comparison of implementation results using Altera Cyclone-III. 
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Figure 15. Generalized throughput and area graph to compare relative performance (Altera 
Cyclone-III). 
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5.2. ASIC 

We also applied the same RTL code to implement the design into ASIC using Synopsys's Design 
Compiler B-2008-09.SP5 and the UMC 0.13 |xm tech library. The maximum target frequency was 
100 MHz, and all the designs met the timing constraints. 

Table 4 compares the ASIC implementation results. As in the FPGA implementation case, the 
speed-opt implementations are not much bigger than the area-opt implementations. The areas of 
speed-opt versions are increased by about 30%-40%. On the other hand, the throughputs of the speed-opt 
implementations are much higher than the area-opt ones, resulting in lower ATP and higher throughput 
per area. Among the same key-length version, there's no significant difference between sequential 
logic sizes, since requiring the number of flip-flops be alike. However, we can observe that the size 
of combinational logic is increased in the speed-opt version. 

Table 4. Comparison of ASIC implementation results. (UMC 0.13 um, Target frequency: 
100 MHz). 
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Figure 16. Generalized throughput and area graph to compare relative performance(ASIC). 
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Figure 16 shows the normalized throughput and size based on LEA- 128- AREA- 1. The relative 
implementation results can be found in the figure. 

6. Comparison 

Table 5 compares the ASIC implementation results of LEA with other existing encryption algorithms. 
First of all, the area of LEA-128-SPEED is larger than other implementations. This is one 
disadvantage of our implementation. However, the throughput of LEA-128-SPEED is higher than 
other implementations. This is caused by the low cycles per block. Even though HummingBird2 
has smaller cycles per block, the block size of LEA-128-SPEED is much larger. Due to the high 
throughput, the throughput per area is relatively higher than other implementations except PRESENT and 
HummingBird2. Although the throughput per area of LEA-128-SPEED is not the best, it shows values 
similar to PRESENT and HummingBird2, which is known to be efficient. If LEA-128-SPEED is applied 
to high speed applications, it will be better than both implementations. Even LEA is targeting high 
software performance, the hardware implementation results are also good compared to other hardware 
implementations. 



Table 5. Comparison to other encryption algorithms. 
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0.052063 



7. Conclusions 

In this paper, we proposed the hardware design and implementation of a new lightweight encryption 
algorithm, LEA. LEA uses the same round function irrespective of key size. However, there are 
differences in its method for implementing key scheduling. Based on the key size, we presented 
suitable hardware designs. For the area-optimized version, we presented a resource- shared structure. 
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Furthermore, by applying on-the-fly key scheduling or scheduling two keys simultaneously, it is possible 
to reduce the number of clock cycles. For the speed-optimized version, we parallelized all operations 
required to a round. Due to parallelization, we could achieve high throughput. After presenting the 
hardware structure of the LEA, we also presented the synthesis result of our design. We implemented our 
designs into Verilog HDL, then synthesized them to a FPGA chip and ASIC. We targeted commonly-used 
FPGA chips, and the open-library for ASIC. From the implementation result, we could observe that there 
is not much area savings of the area-opt version compared to the speed-opt version. This is because the 
structure of the LEA is too simple, so not much savings can be had by sharing components. Therefore, the 
speed-opt version shows better throughput per area than the area-opt version, since the area savings of the 
area-opt version is lower while the speed is significantly lowered. When we compare our implementation 
result to other results, our result is not the best in throughput per area. However, it does belong to a high 
position, and it is the best in throughput. We hope our designs can be improved in the future and we 
present studies on further improvements as future works. 
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